Rework handling general entity references (`&entity;`) #766

Mingun · 2024-06-21T16:11:50Z

This is a big change in handling general entity references and character references. Open PR early to get feedback.

With this changes we can correctly parse document

<!DOCTYPE root [
  <!ENTITY root "<root/>">
]>
&root;

as equivalent normalized document

<root/>

The updated custom_entities example shows how it would be possible to implement requirement from the specification about parsed general entities. Serde deserializer did not updated yet, because this is not trivial part and probably that will be done in another PR.

Of course, such change probably makes the performance worse, I didn't measure impact yet.

Closes #667

codecov-commenter · 2024-06-30T13:29:19Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 61.05033% with 178 lines in your changes missing coverage. Please review.

Project coverage is 60.18%. Comparing base (a9391f3) to head (0631d47).
Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
examples/custom_entities.rs	0.00%	117 Missing ⚠️
src/events/mod.rs	38.98%	36 Missing ⚠️
src/reader/buffered_reader.rs	83.09%	12 Missing ⚠️
benches/macrobenches.rs	0.00%	4 Missing ⚠️
src/de/mod.rs	88.00%	3 Missing ⚠️
src/errors.rs	0.00%	3 Missing ⚠️
benches/microbenches.rs	0.00%	1 Missing ⚠️
src/reader/slice_reader.rs	97.56%	1 Missing ⚠️
src/writer/async_tokio.rs	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #766      +/-   ##
==========================================
- Coverage   60.21%   60.18%   -0.03%     
==========================================
  Files          41       41              
  Lines       16021    16409     +388     
==========================================
+ Hits         9647     9876     +229     
- Misses       6374     6533     +159

Flag	Coverage Δ
unittests	`60.18% <61.05%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Mingun · 2024-06-30T13:45:09Z

I finished work on the base part of the entities support. In this PR new Event::GeneralRef is added together with new BytesRef struct which represents the any &...; reference, including entity references and character references. Character references can be resolved by call BytesRef::resolve_char_ref(), entity references can be resolved by mapping from content of BytesRef to replacement text. Both usages are shown in the updated custom_entities example.

dralley · 2024-06-30T17:16:22Z

I won't have a chance to fully review this for a couple of days. Quick question though, am I correct in thinking that this PR will mean that any time a text block contains one or more entity references, instead of the developer receiving one Event::Text containing everything between the opening and closing tags, they will receive a series of Event::Text and Event::GeneralRef which they will then need to merge back together themselves into the original text?

Mingun · 2024-06-30T18:03:46Z

Yes, you are correct. But that does not mean that he/she will needed to construct the complete text themselves. In the next PR I plan to rename Reader to the RawReader and add make it return borrow-only RawEvents, and in in the another PR introduce new Reader which will automatically merge all consequent Text, CData and GeneralRef events. This should be much more convenient for the average user. RawReader will only be needed for very fine control.

Because renames affects very many places, I want to do that in a separate PR to reduce noise in PR with new Reader.

Borrow-only reader-only events will be useful also in that sense that I plan to add offset member to them to track event position in the stream. When you construct event for writing you are obviously does not have position and I think it is better to not have a dummy value for it, in order to you couldn't mistakenly use writer event in the reading context and get the wrong position.

Because new Reader will have a stack of the RawReaders (in the same way as demonstrated in custom_entities example), it will be simple recreate readers when we will need to change encoding, so I think, that #158 is very close to resolving.

Mingun · 2024-07-01T17:26:35Z

src/reader/mod.rs

+                        Ok((_, false)) => {
+                            // We want to report error at `&`, but offset was increased,
+                            // so return it back (-1 for `&`)
+                            $self.state.last_error_offset = start - 1;
+                            Err(Error::Syntax(SyntaxError::UnclosedReference))
+                        }


I found a code in ruffle in ruffle-rs/ruffle#10471 from @Aaron1011 that will be break (custom_unescape function), because currently dangling & will always return SyntaxError::UnclosedReference. I think I should add a new configuration option for this here

Mingun · 2024-07-21T04:36:50Z

@dralley, what do you think about this?

dralley · 2024-07-28T21:50:24Z

Changelog.md

@@ -24,6 +24,9 @@ XML specification. See the updated `custom_entities` example!

 - [#766]: Allow to parse resolved entities as XML fragments and stream events from them.
 - [#766]: Added new event `Event::GeneralRef` with content of [general entity].
+- [#766]: Added new configuration option `allow_dangling_amp` which allows to have
+  a `&` not followed by `;` in the textual data which is required for some applications
+  for compatibility reasons.


Which applications?

I meant case from #719 here

dralley · 2024-10-12T18:06:11Z

Sorry for not reviewing this promptly. Once the current set of PRs are merged, can you rebase?

Mingun · 2024-10-19T13:49:10Z

@dralley, you can review this. Initially I thought not to merge it until I implement other changes (in the follow-up PRs) that reworks the parser to reduce amount of releases with breaking changes, but because we already have enough amount of breaking changes, I think, it can be merged and included in next release.

dralley · 2024-10-27T04:47:35Z

examples/custom_entities.rs

-    <!ENTITY msg "hello world" >
-    ]>
-    <test label="&msg;">&msg;</test>
+struct MyReader<'i> {


You mentioned a future PR that would implement this functionality on Reader while leaving RawReader for a lower-level event stream. I presume that at that point this example would basically become trivial, as opposed to... this?

My point just being that the current example as-is is a bit much to expect people to implement or copy and paste, and I just want to check my understanding that it's not a long-term solution.

I have not really decided yet whether to leave this example as it is to demonstrate how the reader stack can be implemented if for some reason the standard solution does not work, or rewrite it to a new API. If it remains as it is, then there will be a mention of the standard way

dralley · 2024-10-27T05:05:18Z

benches/macrobenches.rs

@@ -54,7 +54,7 @@ fn parse_document_from_str(doc: &str) -> XmlResult<()> {
                }
            }
            Event::Text(e) => {
-                criterion::black_box(e.unescape()?);
+                criterion::black_box(e.decode()?);


About Reader returning merged events in the future - does that mean that the .decode() would no longer serve a functional purpose if there was no "decoding" (in the text encoding, utf-8 etc. sense) to do? (because the entities are expanded already).

And would there be any way to, say, return the original raw unexpanded XML between two tags from that wrapper, or would that be impossible without dropping to the RawReader level?

Yes, I think, decode method will gone

And would there be any way to, say, return the original raw unexpanded XML between two tags from that wrapper, or would that be impossible without dropping to the RawReader level?

I think, it could be possible to implement that, but only if something requested that. I think, that this is niche feature. The API could be a special method that need to call instead of read_event() to get an unparsed content.

…onstruction in a text failures (16): serde-de (9): borrow::escaped::element borrow::escaped::top_level resolve::resolve_custom_entity trivial::text::byte_buf trivial::text::bytes trivial::text::string::field trivial::text::string::naked trivial::text::string::text xml_schema_lists::element::text::string serde-migrated (1): test_parse_string serde-se (5): with_root::char_amp with_root::char_gt with_root::char_lt with_root::str_escaped with_root::tuple --doc (1): src\de\resolver.rs - de::resolver::EntityResolver (line 13)

…xpanded entities

Text events produces by the Reader can not contain escaped data anymore, all such data is represented by the Event::GeneralRef

Fixed (18): serde-de (9): borrow::escaped::element borrow::escaped::top_level resolve::resolve_custom_entity trivial::text::byte_buf trivial::text::bytes trivial::text::string::field trivial::text::string::naked trivial::text::string::text xml_schema_lists::element::text::string serde-migrated (1): test_parse_string serde-se (5): with_root::char_amp with_root::char_gt with_root::char_lt with_root::str_escaped with_root::tuple --doc (3): src\de\resolver.rs - de::resolver::EntityResolver (line 13)

dralley

Looks reasonable.

Initially I thought not to merge it until I implement other changes (in the follow-up PRs) that reworks the parser to reduce amount of releases with breaking changes, but because we already have enough amount of breaking changes, I think, it can be merged and included in next release.

I personally would prefer to see the Reader / RawReader split happen prior to the next release, or even in this PR in subsequent commits, to reduce the amount of massively breaking API changes going on. I'd rather not split the changes across many releases.

From a purely psychological level I'd be deeply annoyed by having to change code that uses Reader to handle the existence of references throughout the returned event stream, and then subsequently needing to change it back later to something closer to the original approach. That particular type of churn (needing to make a change as a user and then un-make it later) is best avoided.

Mingun · 2024-11-10T10:28:23Z

Totally agree. Ok, then I'll merge it when follow-up PR will be ready

Mingun mentioned this pull request Jun 23, 2024

Convert some unit tests to integration tests #767

Merged

Mingun force-pushed the entity-ref branch from f2eed43 to 908ac15 Compare June 23, 2024 12:59

This was referenced Jun 23, 2024

Remove requirement to have Reader when decoding attributes and write anything converted to Event #760

Merged

Disabling check_end_names is not effective since v0.32 #770

Closed

Fix incorrect end position in read_to_end and read_text #773

Merged

Mingun force-pushed the entity-ref branch 2 times, most recently from da2a802 to 6d9a6ab Compare June 30, 2024 12:59

Mingun marked this pull request as ready for review June 30, 2024 12:59

Mingun requested a review from dralley June 30, 2024 13:00

Mingun force-pushed the entity-ref branch from 6d9a6ab to a5ab870 Compare June 30, 2024 13:17

Mingun mentioned this pull request Jun 30, 2024

How would I parse character references as literal bytes and not codepoints? #667

Open

Mingun force-pushed the entity-ref branch from a5ab870 to db000c1 Compare June 30, 2024 19:08

Mingun commented Jul 1, 2024

View reviewed changes

Mingun marked this pull request as draft July 2, 2024 17:23

This was referenced Jul 5, 2024

Allow attributes in the Event::End and fix .error_position() #780

Merged

Start CDATA section only after uppercase <![CDATA[ #781

Merged

Mingun force-pushed the entity-ref branch 2 times, most recently from b307997 to 51676d2 Compare July 8, 2024 16:44

Mingun marked this pull request as ready for review July 8, 2024 16:44

Mingun force-pushed the entity-ref branch from 51676d2 to 8283ea4 Compare July 8, 2024 17:00

Mingun mentioned this pull request Jul 11, 2024

high frequency of API-breaking releases #782

Closed

Mingun force-pushed the entity-ref branch from 8283ea4 to eb90e9f Compare July 23, 2024 17:46

dralley reviewed Jul 28, 2024

View reviewed changes

Mingun mentioned this pull request Aug 29, 2024

Expose a way to iterate over DeEvents #799

Open

Mingun force-pushed the entity-ref branch from eb90e9f to 286b259 Compare October 18, 2024 20:56

Mingun force-pushed the entity-ref branch 2 times, most recently from 00506c0 to c8fefad Compare October 20, 2024 20:01

dralley reviewed Oct 27, 2024

View reviewed changes

Mingun and others added 6 commits October 28, 2024 03:06

Add tests for XmlSource::read_text

a6d486e

Update custom_entities example to show how to process events from e…

08ec03a

…xpanded entities

Replace BytesText::unescape and unescape_with by decode

094a88e

Text events produces by the Reader can not contain escaped data anymore, all such data is represented by the Event::GeneralRef

Add allow_dangling_amp configuration option and allow dangling &

0631d47

Mingun force-pushed the entity-ref branch from c8fefad to 0631d47 Compare October 27, 2024 22:35

dralley approved these changes Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework handling general entity references (`&entity;`) #766

Rework handling general entity references (`&entity;`) #766

Mingun commented Jun 21, 2024 •

edited

Loading

codecov-commenter commented Jun 30, 2024 •

edited

Loading

Mingun commented Jun 30, 2024

dralley commented Jun 30, 2024 •

edited

Loading

Mingun commented Jun 30, 2024 •

edited

Loading

Mingun Jul 1, 2024 •

edited

Loading

Mingun commented Jul 21, 2024

dralley Jul 28, 2024

Mingun Jul 29, 2024

dralley commented Oct 12, 2024

Mingun commented Oct 19, 2024

dralley Oct 27, 2024 •

edited

Loading

dralley Nov 8, 2024

Mingun Nov 8, 2024

dralley Oct 27, 2024 •

edited

Loading

Mingun Nov 8, 2024

dralley left a comment •

edited

Loading

Mingun commented Nov 10, 2024

Rework handling general entity references (&entity;) #766

Are you sure you want to change the base?

Rework handling general entity references (&entity;) #766

Conversation

Mingun commented Jun 21, 2024 • edited Loading

codecov-commenter commented Jun 30, 2024 • edited Loading

Codecov Report

Mingun commented Jun 30, 2024

dralley commented Jun 30, 2024 • edited Loading

Mingun commented Jun 30, 2024 • edited Loading

Mingun Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Mingun commented Jul 21, 2024

dralley Jul 28, 2024

Choose a reason for hiding this comment

Mingun Jul 29, 2024

Choose a reason for hiding this comment

dralley commented Oct 12, 2024

Mingun commented Oct 19, 2024

dralley Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

dralley Nov 8, 2024

Choose a reason for hiding this comment

Mingun Nov 8, 2024

Choose a reason for hiding this comment

dralley Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Mingun Nov 8, 2024

Choose a reason for hiding this comment

dralley left a comment • edited Loading

Choose a reason for hiding this comment

Mingun commented Nov 10, 2024

Rework handling general entity references (`&entity;`) #766

Rework handling general entity references (`&entity;`) #766

Mingun commented Jun 21, 2024 •

edited

Loading

codecov-commenter commented Jun 30, 2024 •

edited

Loading

dralley commented Jun 30, 2024 •

edited

Loading

Mingun commented Jun 30, 2024 •

edited

Loading

Mingun Jul 1, 2024 •

edited

Loading

dralley Oct 27, 2024 •

edited

Loading

dralley Oct 27, 2024 •

edited

Loading

dralley left a comment •

edited

Loading